Machine Learning: Prediction

Sebastián Pérez Saaibi

Foreword

This presentation borrows intends to be an introduction to the practical aspects of Machine Learning for students of the Computational Methods Uniandes course. The content provided below borrows heavily from Coursera’s Data Science Specialization by Jeff Leek, Roger Peng and Brian Caffo (Johns Hopkins University).

Mathematical Foundations of Prediction

Learning

Unsupervised Learning

Supervised Learning

ML Table

Prediction

In Sample vs Out of Sample Errors

# load data
library(kernlab); data(spam); set.seed(333)
# picking a small subset (10 values) from spam data set
smallSpam <- spam[sample(dim(spam)[1],size=10),]
# label spam = 2 and ham = 1
spamLabel <- (smallSpam$type=="spam")*1 + 1
# plot the capitalAve values for the dataset with colors differentiated by spam/ham (2 vs 1)
plot(smallSpam$capitalAve,col=spamLabel)

# first rule (over-fitting to capture all variation)
rule1 <- function(x){
    prediction <- rep(NA,length(x))
    prediction[x > 2.7] <- "spam"
    prediction[x < 2.40] <- "nonspam"
    prediction[(x >= 2.40 & x <= 2.45)] <- "spam"
    prediction[(x > 2.45 & x <= 2.70)] <- "nonspam"
    return(prediction)
}
# tabulate results of prediction algorithm 1 (in sample error --> no error in this case)
table(rule1(smallSpam$capitalAve),smallSpam$type)
##          
##           nonspam spam
##   nonspam       5    0
##   spam          0    5
# second rule (simple, setting a threshold)
rule2 <- function(x){
    prediction <- rep(NA,length(x))
    prediction[x > 2.8] <- "spam"
    prediction[x <= 2.8] <- "nonspam"
    return(prediction)
}
# tabulate results of prediction algorithm 2(in sample error --> 10% in this case)
table(rule2(smallSpam$capitalAve),smallSpam$type)
##          
##           nonspam spam
##   nonspam       5    1
##   spam          0    4
# tabulate out of sample error for algorithm 1
table(rule1(spam$capitalAve),spam$type)
##          
##           nonspam spam
##   nonspam    2141  588
##   spam        647 1225
# tabulate out of sample error for algorithm 2
table(rule2(spam$capitalAve),spam$type)
##          
##           nonspam spam
##   nonspam    2224  642
##   spam        564 1171
# accuracy and total correct for algorithm 1 and 2
rbind("Rule 1" = c(Accuracy = mean(rule1(spam$capitalAve)==spam$type),
    "Total Correct" = sum(rule1(spam$capitalAve)==spam$type)),
    "Rule 2" = c(Accuracy = mean(rule2(spam$capitalAve)==spam$type),
    "Total Correct" = sum(rule2(spam$capitalAve)==spam$type)))
##         Accuracy Total Correct
## Rule 1 0.7315801          3366
## Rule 2 0.7378831          3395

Prediction Study Design

Sample Division Guidelines for Prediction Study Design

Picking the Right Data

Types of Errors

Notable Measurements for Error – Binary Variables

Notable Measurements for Error – Continuous Variables

Receiver Operating Characteristic Curves

Cross Validation

Random Subsampling

* a randomly sampled test set is subsetted out from the original training set * the predictor is built on the remaining training data and applied to the test set * the above are three random subsamplings from the same training set * considerations - must be done without replacement - random sampling with replacement = bootstrap + underestimates of the error + can be corrected, but it is complicated (0.632 Bootstrap)

K-Fold

* break training set into \(K\) subsets (above is a 3-fold cross validation) * build the model/predictor on the remaining training data in each subset and applied to the test subset * rebuild the data \(K\) times with the training and test subsets and average the findings * considerations
- larger k = less bias, more variance - smaller k = more bias, less variance

Leave One Out

* leave out exactly one sample and build predictor on the rest of training data * predict value for the left out sample * repeat for each sample

caret Package (tutorial)

Data Slicing

# load packages and data
library(caret)
# create training set indexes with 75% of data
inTrain <- createDataPartition(y=spam$type,p=0.75, list=FALSE)
# subset spam data to training
training <- spam[inTrain,]
# subset spam data (the rest) to test
testing <- spam[-inTrain,]
# dimension of original and training dataset
rbind("original dataset" = dim(spam),"training set" = dim(training))
##                  [,1] [,2]
## original dataset 4601   58
## training set     3451   58
# create 10 folds for cross validation and return the training set indices
folds <- createFolds(y=spam$type,k=10,list=TRUE,returnTrain=TRUE)
# structure of the training set indices
str(folds)
## List of 10
##  $ Fold01: int [1:4141] 1 2 3 4 6 7 8 9 10 12 ...
##  $ Fold02: int [1:4140] 1 2 3 4 5 7 8 9 10 11 ...
##  $ Fold03: int [1:4141] 1 3 4 5 6 7 8 9 10 11 ...
##  $ Fold04: int [1:4141] 2 3 4 5 6 7 8 9 10 11 ...
##  $ Fold05: int [1:4141] 1 2 3 4 5 6 7 8 9 10 ...
##  $ Fold06: int [1:4141] 1 2 3 4 5 6 7 8 9 10 ...
##  $ Fold07: int [1:4141] 1 2 3 4 5 6 8 9 11 12 ...
##  $ Fold08: int [1:4141] 1 2 3 4 5 6 7 8 9 10 ...
##  $ Fold09: int [1:4141] 1 2 4 5 6 7 10 11 12 13 ...
##  $ Fold10: int [1:4141] 1 2 3 5 6 7 8 9 10 11 ...
# return the test set indices instead
# note: returnTrain = FALSE is unnecessary as it is the default behavior
folds.test <- createFolds(y=spam$type,k=10,list=TRUE,returnTrain=FALSE)
str(folds.test)
## List of 10
##  $ Fold01: int [1:460] 15 16 18 40 45 62 68 81 82 102 ...
##  $ Fold02: int [1:459] 1 41 55 58 67 75 117 123 151 175 ...
##  $ Fold03: int [1:461] 3 14 66 69 70 80 90 112 115 135 ...
##  $ Fold04: int [1:460] 5 19 25 65 71 83 85 88 91 93 ...
##  $ Fold05: int [1:460] 6 10 17 21 26 56 57 104 107 116 ...
##  $ Fold06: int [1:459] 7 8 13 39 52 54 76 89 99 106 ...
##  $ Fold07: int [1:461] 4 23 27 29 32 33 34 38 49 51 ...
##  $ Fold08: int [1:460] 2 9 30 31 36 37 43 46 47 48 ...
##  $ Fold09: int [1:461] 12 20 24 44 53 59 60 64 84 98 ...
##  $ Fold10: int [1:460] 11 22 28 35 42 61 72 86 92 118 ...
# return first 10 elements of the first training set
folds[[1]][1:10]
##  [1]  1  2  3  4  6  7  8  9 10 12
# create 10 resamples
resamples <- createResample(y=spam$type,times=10,list=TRUE)
# structure of the resamples (note some samples are repeated)
str(resamples)
## List of 10
##  $ Resample01: int [1:4601] 1 4 4 4 7 8 12 13 13 14 ...
##  $ Resample02: int [1:4601] 3 3 5 7 10 12 12 13 13 14 ...
##  $ Resample03: int [1:4601] 1 2 2 3 4 5 8 10 11 12 ...
##  $ Resample04: int [1:4601] 1 3 3 4 7 8 8 9 10 14 ...
##  $ Resample05: int [1:4601] 2 4 5 6 7 7 8 8 9 12 ...
##  $ Resample06: int [1:4601] 3 6 6 7 8 9 12 13 13 14 ...
##  $ Resample07: int [1:4601] 1 2 2 5 5 6 7 8 9 10 ...
##  $ Resample08: int [1:4601] 2 2 3 4 4 7 7 8 8 9 ...
##  $ Resample09: int [1:4601] 1 4 7 8 8 9 12 13 15 15 ...
##  $ Resample10: int [1:4601] 1 3 4 4 7 7 9 9 10 11 ...
# create time series data
tme <- 1:1000
# create time slices
folds <- createTimeSlices(y=tme,initialWindow=20,horizon=10)
# name of lists
names(folds)
## [1] "train" "test"
# first training set
folds$train[[1]]
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
# first test set
folds$test[[1]]
##  [1] 21 22 23 24 25 26 27 28 29 30

Training Options (tutorial)

# returns the arguments of the default train function
args(train.default)
## function (x, y, method = "rf", preProcess = NULL, ..., weights = NULL, 
##     metric = ifelse(is.factor(y), "Accuracy", "RMSE"), maximize = ifelse(metric == 
##         "RMSE", FALSE, TRUE), trControl = trainControl(), tuneGrid = NULL, 
##     tuneLength = 3) 
## NULL
# returns the default arguments for the trainControl object
args(trainControl)
## function (method = "boot", number = ifelse(grepl("cv", method), 
##     10, 25), repeats = ifelse(grepl("cv", method), 1, number), 
##     p = 0.75, initialWindow = NULL, horizon = 1, fixedWindow = TRUE, 
##     verboseIter = FALSE, returnData = TRUE, returnResamp = "final", 
##     savePredictions = FALSE, classProbs = FALSE, summaryFunction = defaultSummary, 
##     selectionFunction = "best", preProcOptions = list(thresh = 0.95, 
##         ICAcomp = 3, k = 5), index = NULL, indexOut = NULL, timingSamps = 0, 
##     predictionBounds = rep(FALSE, 2), seeds = NA, adaptive = list(min = 5, 
##         alpha = 0.05, method = "gls", complete = TRUE), allowParallel = TRUE) 
## NULL

Plotting Predictors (tutorial)

# load relevant libraries
library(ISLR); library(ggplot2);
# load wage data
data(Wage)
# create training and test sets
inTrain <- createDataPartition(y=Wage$wage,p=0.7, list=FALSE)
training <- Wage[inTrain,]
testing <- Wage[-inTrain,]
# plot relationships between the predictors and outcome
featurePlot(x=training[,c("age","education","jobclass")], y = training$wage,plot="pairs")

# qplot plus linear regression lines
qplot(age,wage,colour=education,data=training)+geom_smooth(method='lm',formula=y~x)

# load Hmisc and gridExtra packages
library(Hmisc);library(gridExtra);
# cute the wage variable
cutWage <- cut2(training$wage,g=3)
# plot the boxplot
p1 <- qplot(cutWage,age, data=training,fill=cutWage,
      geom=c("boxplot"))
# plot boxplot and point clusters
p2 <- qplot(cutWage,age, data=training,fill=cutWage,
      geom=c("boxplot","jitter"))
# plot the two graphs side by side
grid.arrange(p1,p2,ncol=2)

# tabulate the cutWage and jobclass variables
t <- table(cutWage,training$jobclass)
# print table
t
##                
## cutWage         1. Industrial 2. Information
##   [ 20.1, 92.7)           445            256
##   [ 92.7,119.1)           376            347
##   [119.1,318.3]           269            409
# convert to proportion table based on the rows
prop.table(t,1)
##                
## cutWage         1. Industrial 2. Information
##   [ 20.1, 92.7)     0.6348074      0.3651926
##   [ 92.7,119.1)     0.5200553      0.4799447
##   [119.1,318.3]     0.3967552      0.6032448
# produce density plot
qplot(wage,colour=education,data=training,geom="density")

Preprocessing (tutorial)

# load spam data
data(spam)
# create train and test sets
inTrain <- createDataPartition(y=spam$type,p=0.75, list=FALSE)
training <- spam[inTrain,]
testing <- spam[-inTrain,]
# create preProcess object for all predictors ("-58" because 58th = outcome)
preObj <- preProcess(training[,-58],method=c("center","scale"))
# normalize training set
trainCapAveS <- predict(preObj,training[,-58])$capitalAve
# normalize test set using training parameters
testCapAveS <- predict(preObj,testing[,-58])$capitalAve
# compare results for capitalAve variable
rbind(train = c(mean = mean(trainCapAveS), std = sd(trainCapAveS)),
    test = c(mean(testCapAveS), sd(testCapAveS)))
##               mean      std
## train 6.362510e-18 1.000000
## test  7.548133e-02 1.633866
library(e1071)
# set up BoxCox transforms
preObj <- preProcess(training[,-58],method=c("BoxCox"))
# perform preprocessing on training data
trainCapAveS <- predict(preObj,training[,-58])$capitalAve
# plot histogram and QQ Plot
# Note: the transformation definitely helped to
# normalize the data but it does not produce perfect result
par(mfrow=c(1,2)); hist(trainCapAveS); qqnorm(trainCapAveS)

# Make some values NA
training$capAve <- training$capitalAve
selectNA <- rbinom(dim(training)[1],size=1,prob=0.05)==1
training$capAve[selectNA] <- NA
# Impute and standardize
preObj <- preProcess(training[,-58],method="knnImpute")
capAve <- predict(preObj,training[,-58])$capAve
# Standardize true values
capAveTruth <- training$capitalAve
capAveTruth <- (capAveTruth-mean(capAveTruth))/sd(capAveTruth)
# compute differences between imputed values and true values
quantile(capAve - capAveTruth)
##            0%           25%           50%           75%          100% 
## -1.8242434686 -0.0005069839  0.0007532392  0.0013598891  0.4062295931

Covariate Creation/Feature Extraction

Creating Dummy Variables

# setting up data
inTrain <- createDataPartition(y=Wage$wage,p=0.7, list=FALSE)
training <- Wage[inTrain,]; testing <- Wage[-inTrain,]
# create a dummy variable object
dummies <- dummyVars(wage ~ jobclass,data=training)
# create the dummy variable columns
head(predict(dummies,newdata=training))
##        jobclass.1. Industrial jobclass.2. Information
## 231655                      1                       0
## 86582                       0                       1
## 11443                       0                       1
## 305706                      1                       0
## 160191                      1                       0
## 448410                      1                       0

Removing Zero Covariates

# print nearZeroVar table
nearZeroVar(training,saveMetrics=TRUE)
##            freqRatio percentUnique zeroVar   nzv
## year        1.017647    0.33301618   FALSE FALSE
## age         1.231884    2.85442436   FALSE FALSE
## sex         0.000000    0.04757374    TRUE  TRUE
## maritl      3.329571    0.23786870   FALSE FALSE
## race        8.480583    0.19029496   FALSE FALSE
## education   1.393750    0.23786870   FALSE FALSE
## region      0.000000    0.04757374    TRUE  TRUE
## jobclass    1.070936    0.09514748   FALSE FALSE
## health      2.526846    0.09514748   FALSE FALSE
## health_ins  2.209160    0.09514748   FALSE FALSE
## logwage     1.011765   18.83920076   FALSE FALSE
## wage        1.011765   18.83920076   FALSE FALSE

Creating Splines (Polynomial Functions)

# load splines package
library(splines)
# create polynomial function
bsBasis <- bs(training$age,df=3)
# fit the outcome on the three polynomial terms
lm1 <- lm(wage ~ bsBasis,data=training)
# plot all age vs wage data
plot(training$age,training$wage,pch=19,cex=0.5)
# plot the fitted polynomial function
points(training$age,predict(lm1,newdata=training),col="red",pch=19,cex=0.5)

# predict on test values
head(predict(bsBasis,age=testing$age))
##               1          2           3
## [1,] 0.00000000 0.00000000 0.000000000
## [2,] 0.23685006 0.02537679 0.000906314
## [3,] 0.36252559 0.38669397 0.137491189
## [4,] 0.42616898 0.14823269 0.017186399
## [5,] 0.44221829 0.19539878 0.028779665
## [6,] 0.01793746 0.20448709 0.777050955

Multicore Parallel Processing

Preprocessing with Principal Component Analysis (PCA)

prcomp Function

# load  spam data
data(spam)
# perform PCA on dataset
prComp <- prcomp(log10(spam[,-58]+1))
# print out the eigenvector/rotations first 5 rows and PCs
head(prComp$rotation[, 1:5], 5)
##                 PC1           PC2         PC3         PC4          PC5
## make    0.019370409  0.0427855959 -0.01631961  0.02798232 -0.014903314
## address 0.010827343  0.0408943785  0.07074906 -0.01407049  0.037237531
## all     0.040923168  0.0825569578 -0.03603222  0.04563653  0.001222215
## num3d   0.006486834 -0.0001333549  0.01234374 -0.01005991 -0.001282330
## our     0.036963221  0.0941456085 -0.01871090  0.05098463 -0.010582039
# create new variable that marks spam as 2 and nospam as 1
typeColor <- ((spam$type=="spam")*1 + 1)
# plot the first two principal components
plot(prComp$x[,1],prComp$x[,2],col=typeColor,xlab="PC1",ylab="PC2")

caret Package

# create train and test sets
inTrain <- createDataPartition(y=spam$type,p=0.75, list=FALSE)
training <- spam[inTrain,]
testing <- spam[-inTrain,]
# create preprocess object
preProc <- preProcess(log10(training[,-58]+1),method="pca",pcaComp=2)
# calculate PCs for training data
trainPC <- predict(preProc,log10(training[,-58]+1))
# run model on outcome and principle components
modelFit <- train(training$type ~ .,method="glm",data=trainPC)
# calculate PCs for test data
testPC <- predict(preProc,log10(testing[,-58]+1))
# compare results
confusionMatrix(testing$type,predict(modelFit,testPC))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction nonspam spam
##    nonspam     656   41
##    spam         82  371
##                                           
##                Accuracy : 0.893           
##                  95% CI : (0.8737, 0.9103)
##     No Information Rate : 0.6417          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7724          
##  Mcnemar's Test P-Value : 0.0003101       
##                                           
##             Sensitivity : 0.8889          
##             Specificity : 0.9005          
##          Pos Pred Value : 0.9412          
##          Neg Pred Value : 0.8190          
##              Prevalence : 0.6417          
##          Detection Rate : 0.5704          
##    Detection Prevalence : 0.6061          
##       Balanced Accuracy : 0.8947          
##                                           
##        'Positive' Class : nonspam         
## 
# construct model
modelFit <- train(training$type ~ .,method="glm",preProcess="pca",data=training)
# print results of model
confusionMatrix(testing$type,predict(modelFit,testing))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction nonspam spam
##    nonspam     668   29
##    spam         59  394
##                                           
##                Accuracy : 0.9235          
##                  95% CI : (0.9066, 0.9382)
##     No Information Rate : 0.6322          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8379          
##  Mcnemar's Test P-Value : 0.001992        
##                                           
##             Sensitivity : 0.9188          
##             Specificity : 0.9314          
##          Pos Pred Value : 0.9584          
##          Neg Pred Value : 0.8698          
##              Prevalence : 0.6322          
##          Detection Rate : 0.5809          
##    Detection Prevalence : 0.6061          
##       Balanced Accuracy : 0.9251          
##                                           
##        'Positive' Class : nonspam         
## 

Predicting with Regression

R Commands and Examples

# load data
data(faithful)
# create train and test sets
inTrain <- createDataPartition(y=faithful$waiting, p=0.5, list=FALSE)
trainFaith <- faithful[inTrain,]; testFaith <- faithful[-inTrain,]
# build linear model
lm1 <- lm(eruptions ~ waiting,data=trainFaith)
# print summary of linear model
summary(lm1)
## 
## Call:
## lm(formula = eruptions ~ waiting, data = trainFaith)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.30541 -0.35970  0.04711  0.37624  1.07683 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.870699   0.225663   -8.29 1.01e-13 ***
## waiting      0.075674   0.003137   24.12  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5024 on 135 degrees of freedom
## Multiple R-squared:  0.8117, Adjusted R-squared:  0.8103 
## F-statistic: 581.8 on 1 and 135 DF,  p-value: < 2.2e-16
# predict eruptions for new waiting time
newdata <- data.frame(waiting=80)
predict(lm1,newdata)
##        1 
## 4.183192
# create 1 x 2 panel plot
par(mfrow=c(1,2))
# plot train data with the regression line
plot(trainFaith$waiting,trainFaith$eruptions,pch=19,col="blue",xlab="Waiting",
    ylab="Duration", main = "Train")
lines(trainFaith$waiting,predict(lm1),lwd=3)
# plot test data with the regression line
plot(testFaith$waiting,testFaith$eruptions,pch=19,col="blue",xlab="Waiting",
    ylab="Duration", main = "Test")
lines(testFaith$waiting,predict(lm1,newdata=testFaith),lwd=3)

# Calculate RMSE on training and test sets
c(trainRMSE = sqrt(sum((lm1$fitted-trainFaith$eruptions)^2)),
    testRMSE = sqrt(sum((predict(lm1,newdata=testFaith)-testFaith$eruptions)^2)))
## trainRMSE  testRMSE 
##  5.836980  5.701161
# calculate prediction interval
pred1 <- predict(lm1,newdata=testFaith,interval="prediction")
# plot data points (eruptions, waiting)
plot(testFaith$waiting,testFaith$eruptions,pch=19,col="blue")
# plot fit line and prediction interval
matlines(testFaith$waiting,pred1,type="l",,col=c(1,2,2),lty = c(1,1,1), lwd=3)

# create train and test sets
inTrain <- createDataPartition(y=Wage$wage,p=0.7, list=FALSE)
training <- Wage[inTrain,]; testing <- Wage[-inTrain,]
# fit linear model for age jobclass and education
modFit<- train(wage ~ age + jobclass + education,method = "lm",data=training)
# store final model
finMod <- modFit$finalModel
# set up 2 x 2 panel plot
par(mfrow = c(2, 2))
# construct diagnostic plots for model
plot(finMod,pch=19,cex=0.5,col="#00000010")

# plot residual by index
plot(finMod$residuals,pch=19,cex=0.5)

* here the residuals increase linearly with the index, and the highest residuals are concentrated in the higher indexes, so there must be a missing variable

Prediction with Trees

Process

  1. start with all variables in one group
  2. find the variable that best splits the outcomes into two groups
  3. divide data into two groups (leaves) based on the split performed (node)
  4. within each split, find variables to split the groups again
  5. continue this process until all groups are sufficiently small/homogeneous/“pure”

Measures of Impurity (Reference)

\[\hat{p}_{mk} = \frac{\sum_{i}^m \mathbb{1}(y_i = k)}{N_m}\]

# set margin and seed
par(mar=c(1,1,1,1), mfrow = c(1, 2)); set.seed(1234);
# simulate data
x = rep(1:4,each=4); y = rep(1:4,4)
# plot first scenario
plot(x,y,xaxt="n",yaxt="n",cex=3,col=c(rep("blue",15),rep("red",1)),pch=19)
# plot second scenario
plot(x,y,xaxt="n",yaxt="n",cex=3,col=c(rep("blue",8),rep("red",8)),pch=19)

* left graph - Misclassification: \(\frac{1}{16} = 0.06\) - Gini: \(1 - [(\frac{1}{16})^2 + (\frac{15}{16})^2] = 0.12\) - Information:\(-[\frac{1}{16} \times \log_2 (\frac{1}{16})+ \frac{15}{16} \times \log_2(\frac{1}{16})] = 0.34\) * right graph - Misclassification: \(\frac{8}{16} = 0.5\) - Gini: \(1 - [(\frac{8}{16})^2 + (\frac{8}{16})^2] = 0.5\) - Information:\(-[\frac{8}{16} \times \log_2 (\frac{8}{16})+ \frac{8}{16} \times \log_2(\frac{8}{16})] = 1\)

Constructing Trees with caret Package

library(rattle)
library(rpart)
# load iris data set
data(iris)
# create test/train data sets
inTrain <- createDataPartition(y=iris$Species,p=0.7, list=FALSE)
training <- iris[inTrain,]
testing <- iris[-inTrain,]
# fit classification tree as a model
modFit <- train(Species ~ .,method="rpart",data=training)
# print the classification tree
print(modFit$finalModel)
## n= 105 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 105 70 setosa (0.33333333 0.33333333 0.33333333)  
##   2) Petal.Length< 2.45 35  0 setosa (1.00000000 0.00000000 0.00000000) *
##   3) Petal.Length>=2.45 70 35 versicolor (0.00000000 0.50000000 0.50000000)  
##     6) Petal.Width< 1.65 34  1 versicolor (0.00000000 0.97058824 0.02941176) *
##     7) Petal.Width>=1.65 36  2 virginica (0.00000000 0.05555556 0.94444444) *
# plot the classification tree
rattle::fancyRpartPlot(modFit$finalModel)

# predict on test values
predict(modFit,newdata=testing)
##  [1] setosa     setosa     setosa     setosa     setosa     setosa    
##  [7] setosa     setosa     setosa     setosa     setosa     setosa    
## [13] setosa     setosa     setosa     versicolor versicolor versicolor
## [19] versicolor versicolor versicolor versicolor versicolor versicolor
## [25] versicolor versicolor versicolor versicolor versicolor versicolor
## [31] virginica  virginica  virginica  virginica  virginica  virginica 
## [37] versicolor virginica  virginica  versicolor versicolor virginica 
## [43] virginica  virginica  virginica 
## Levels: setosa versicolor virginica

Bagging

# load data
library(ElemStatLearn); data(ozone,package="ElemStatLearn")
# reorder rows based on ozone variable
ozone <- ozone[order(ozone$ozone),]
# create empty matrix
ll <- matrix(NA,nrow=10,ncol=155)
# iterate 10 times
for(i in 1:10){
    # create sample from data with replacement
    ss <- sample(1:dim(ozone)[1],replace=T)
    # draw sample from the dataa and reorder rows based on ozone
    ozone0 <- ozone[ss,]; ozone0 <- ozone0[order(ozone0$ozone),]
    # fit loess function through data (similar to spline)
    loess0 <- loess(temperature ~ ozone,data=ozone0,span=0.2)
    # prediction from loess curve for the same values each time
    ll[i,] <- predict(loess0,newdata=data.frame(ozone=1:155))
}
# plot the data points
plot(ozone$ozone,ozone$temperature,pch=19,cex=0.5)
# plot each prediction model
for(i in 1:10){lines(1:155,ll[i,],col="grey",lwd=2)}
# plot the average in red
lines(1:155,apply(ll,2,mean),col="red",lwd=2)

Bagging Algorithms

# load relevant package and data
library(party); data(ozone,package="ElemStatLearn")
# reorder rows based on ozone variable
ozone <- ozone[order(ozone$ozone),]
# extract predictors
predictors <- data.frame(ozone=ozone$ozone)
# extract outcome
temperature <- ozone$temperature
# run bagging algorithm
treebag <- bag(predictors, temperature, B = 10,
                # custom bagging function
                bagControl = bagControl(fit = ctreeBag$fit,
                                        predict = ctreeBag$pred,
                                        aggregate = ctreeBag$aggregate))
# plot data points
plot(ozone$ozone,temperature,col='lightgrey',pch=19)
# plot the first fit
points(ozone$ozone,predict(treebag$fits[[1]]$fit,predictors),pch=19,col="red")
# plot the aggregated predictions
points(ozone$ozone,predict(treebag,predictors),pch=19,col="blue")

Random Forest

* process - bootstrap samples from training data (with replacement) - split and bootstrap variables - grow trees (repeat split/bootstrap) and vote/average final trees

R Commands and Examples

library(randomForest)
# load data
data(iris)
# create train/test data sets
inTrain <- createDataPartition(y=iris$Species,p=0.7, list=FALSE)
training <- iris[inTrain,]
testing <- iris[-inTrain,]
# apply random forest
modFit <- train(Species~ .,data=training,method="rf",prox=TRUE)
# return the second tree (first 6 rows)
head(getTree(modFit$finalModel,k=2))
##   left daughter right daughter split var split point status prediction
## 1             2              3         3        2.60      1          0
## 2             0              0         0        0.00     -1          1
## 3             4              5         4        1.65      1          0
## 4             6              7         3        5.25      1          0
## 5             0              0         0        0.00     -1          3
## 6             0              0         0        0.00     -1          2
# compute cluster centers
irisP <- classCenter(training[,c(3,4)], training$Species, modFit$finalModel$prox)
# convert irisP to data frame and add Species column
irisP <- as.data.frame(irisP); irisP$Species <- rownames(irisP)
# plot data points
p <- qplot(Petal.Width, Petal.Length, col=Species,data=training)
# add the cluster centers
p + geom_point(aes(x=Petal.Width,y=Petal.Length,col=Species),size=5,shape=4,data=irisP)

# predict outcome for test data set using the random forest model
pred <- predict(modFit,testing)
# logic value for whether or not the rf algorithm predicted correctly
testing$predRight <- pred==testing$Species
# tabulate results
table(pred,testing$Species)
##             
## pred         setosa versicolor virginica
##   setosa         15          0         0
##   versicolor      0         13         1
##   virginica       0          2        14
# plot data points with the incorrect classification highlighted
qplot(Petal.Width,Petal.Length,colour=predRight,data=testing,main="newdata Predictions")

Boosting

* we start with space with blue + and red - and the goal is to classify all the object correctly * only straight lines will be used for classification

* from the above, we can see that a group of weak predictors (lines in this case), can be combined and weighed to become a much stronger predictor